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ABSTRACT " 

In this paper, an attempt has » aen made to synthesize 
some of the current thinking in the area of criterion-referenced 
testing as well as to provide the beginning of an integration of 
theory and method for such testing. Since criterion-referenced 
testing is viewed from a decision-theoretic point of view, approaches 
to reliability and validity estimation consistent with this 
philosophy are suggested. Also, to improve the decision-making 
accuracy of criterion-referenced tests, a Bayesian procedure for 
estimating true mastery scores has been proposed. This Bayesian 
procedure uses information about other members of a student's group 
(collateral information) , but the resulting is still 
criterion- referenced rather than norm- referenced in that the student 
is compared to a standard rather than to ether students, in theory, 
the Bayesian procedure increases the "effective length" of the test 
by improving the reliability, the validity, and mora importantly ,> the 
decision-making accuracy of the criterion-referenced test scores. 
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ABSTRACT 



In this paper, an attempt has been made to synthesize some of the current thinking in the 
area of criterion-referenced testing as well as to provide the beginning of an integration of 
theory and method for such testing. Since criterion-reference testing is viewed from a 
decision:theoretic point of view^ approaches to relfability and validity estimation 
consistent with this philosophy are suggested. Also, to in^rove the decision-making accuracy 
of criterion-referenced tests, a Bayesian procedure for estimating true mastery scores has been 
proposed. This Bayesian procedure uses information about other members of a student's group 
(collateral information), but the resulting estimation is still criterion-referenced rather than 
norm-referenced in that the student is compared to a standard rather than to other students. In 
theory, the Bayesian procedure increases the " .effective length" of the test by improving the 
reliability, the validity, and more importantly, the decision making accuracy of the 
criterion-referenced test scores. 
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Over {Ije years, standard procedures for con- 
structing, administering, and analyzing tests, and 
interpreting scores in the context of standard 
instructional models and methods have become 
well-known to educators. With these models, tests 
have been used primarily and most successfully to 
estimate oach exami'^^e's ability level and to 
permit comparative statements (e.g., ranking) 
across examinees. Recently, however, there have 
been numerous suggestions for, and demonstra- 
tions of, instructional models and methods in the 
schools where the well-known classical mental test 
models for test construction and test score inter- 
pretation appear to be less useful. Example's of 
these instructional models include: Computer- 
Assisted Instruction (Atkinson, 1968; Suppes, 
1966), Individually Prescribed Instruction (G laser, 
1968), Project PLAN (Flanagan, 1967, 1969), and 
A Model of School Learning (Carroll, 1963, 1970; 
Bloom, 1968;^BIock, 1971). Common to mostJof 
these instructional models as well as to several 
others are such features as the specification of the 
curriculum in terms of behavioral objectives, 
detailed diagnosis of beginning students, the avail- 
ability of multiple instructional modes, individual 
pacing and sequencing of material, and the careful 
monitoring of student progress. 

While not all educators agree on the usefulness 
of these instructional models in the schools, the 
position taken in this paper is that these models are 
usef' ', and that their useiulness will be enhanced 
by 0 /eloping testing methods and decision pro- 
cedures specifically designed for use within the 
context of these models. The purpose of this paper 
is to outline some appropriate statistical methods 
that may prove of use in making instructional 
decisions for students. 

It appears that much of the discussion in this 
area (for example, see Block, 1971; Carver, 1970; 
Ebel, 1971; and Glaser & Nitko, 1971) stems from 
different understandings as to the basic purpose of , 



testing in these instructional models. It would seem 
to us thdt in most cases the pertinent question is 
whether or not the individual examinee has 
attained some prescribed degree of competence on 
an instructional performance task (see, for 
example, Harris, 1972b), Questions of precise 
achievement levels and comparisons among indi- 
viduals on these levels seem tc be largely irrelevant. 
In many of the nevi/ instructional models, tests are 
used to determine on which instructional 
objectives an examinee has met the acceptable 
performance level standard set by the model 
designer. This test information is usually used 
immediatefy to evaluate the student's mastery of 
the instructional objectives covered in the test, so 
as to appropriately locate him for his next instruc- 
tion (Glaser & Nitko, 1971). Tests especially 
designed for this particular purpose have come to 
be known as criterion-referenced tests. Criterion- 
referenced tests are specifically designed to meet 
the measurement needs of the new instructional 
models. In contrast, the better known norm- 
referencea tests are principally designed to produce 
test ?cores suitable for ranking individuals'on the 
ability measured by the test. Sometimes this occurs 
with the understanding that some cut-off score will 
be introduced to reject some percentage of stu- 
dents for the next level of instruction. 
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Criterion-Referenced Tests: Definitionfi and Selected Issues 



A "criterion-referenced test" has oeen defined in 
a multitude of ways in the literature. (See, for 
example, G laser & Nitko, 1971; Harris & Stewart, 
1971; Ivens, 1970; Kriewall, 1969' and Livingston, 
1972a). The definitions are sufficiently different 
that a test may be classified as norm-referenced 
according to one definition, criterion referenced 
according to another, or more typically, exhibit 
characteristics of each to a greater or lesser extent 
depending on the definition. The intentionally 
most restrictive definition of a criterion-referenced 
test was proposed by Harris and Stewart (1971): 
"A pure criterion-referenced test is one consisting 
of a sample of production tasks drawn from a 
well-defined population ^^performance, a sample 
that^ may be used to estimate the proportion of 
perf'^rmances in that population at which the 
student can succeed." On the other hand, possibly 
the least restrictive definition is that by Ivens 
(1970) who defined a criterion-referenced test as 
"one made up of items keyed to a set of behavioral 
objectives." A very flexible definition has been 
proposed by Glaser and Nitko (1971): "A 
criterion-referenced test is one that is deliberately 
constructed so as to yield measurements that are 
directly interpretable in terms of specified per- 
formance standards." According to Glaser and 
Nitko, "The performance standards are usually 
specified by defining some domain of tasks that 
the student should perform. Representative 
samples of tasks from this domain are organized 
into a test. Measurements are taken and are used to 
make a statement about the performance of each 
individual relative; to that domain." This definition 
is less restrictive than Harris and Stewart's in that it 
does not limit consideration to a single instruc- 
tional objective. A common thread running 
through the various approaches to criterion- 
referenced tests is that the definition of a well- 
specified content domain and the development of 
procedures for generating appropriate samples of 
test items are important. (For more on this, see, 
Bormuth, 1970; Glaser & Nitko, 1971;and Hively, 
Patterson, & Page, 1968.) 

It should be noted that these are also concerns 
of those interested in constructing norm-referenced 
tests; however, not to the same extent. Less often 



is there an interest in making inferences about 
which particular skills an individual has or does not 
have from his performance on a norTi-referenced 
test. Thus, norm -referenced testing is seldom 
diagnostic. Primary examples would be the 
Scholastic Aptftude Test (SAT) and, to a lesser 
extent, the ACT Assessment. Ex.ceptions would be 
tests such as the fowa Tests of Basic Ski/fs which 
have important features . of both norm- and 
criterion-referenced tests. Such tests are norm- 
referenced because they are geared tp reporting 
how well a student compared with others in certain 
well-defined populations (e.g., through percentile 
scores). Yet, they are criterion-referenced in that 
they are keyed to specific instructional objectives, 
are multiscaled, and diagnostic. However, they do' 
not involve apriori judgment as to acceptable 
performance levels and a consequent judgment as 
to whether or not an individual student attains this 
performance level. Further distinctions between 
norm-referenced tests and criterion-referenced tests 
have been presented by Block (1971), Ebel (1971), 
Glaser (1963), Glaser and Nitko (1971), 
Hambleton and Gorth (1971), Hieronymus (1972), 
and Popham and Husek (1969). ' 

If one accepts the Glaser and Nitko definition of 
a criterion-referenced test, it is app&rent that the 
test may often be multidimensional while made up 
of unidimensional subscales." That is, the items 
from a criterion-referenced tesi are organized in 
distinct and different subscales of homogeneous 
items measuring common skills. (The possibility of 
a single item subscale is not ruled out.) An 
instructional decision for each individual is then 
often made on the basis of his performance on 
each subscale. Major interest may, thus, rest on the 
reliability and validity of subscale scores. 

One of the problems yet to be reckoned with for 
criterion-referenced tests is an instance of the 
, .bandwidth-fidelity issue (Cronbach & Gleser, 
1965). When the total ^testinci time is fixed and 
there is interest in measuring many competencies, 
one may be faced with the problem oiJ«hether to 
obtain very precise information about a small 
number of competencies or less precise informa- 
tion about many more competencies. Time alloca- 
tion algorithms (analytical procedures for deciding 



how many items on a test should measure each 
objective) of a rather different kind than those 
presented by Woodbury and Novick (1968) and 
Jackson and Novick (1970) will be required. They 
will be closer in spirit, but not identical to those ^ 
given by Cronbach and Gleser ( 1965). The problem 



of how to fix the length of each subscale so as to 
maximize the peicentage of correct decisions or 
some similar measure or overall decision-making 
accuracy on the basis of te&trresults has yet to be 
resolved or, indeed, to be formulated satisfactorily;:, 



Oistinction among Testing Instruments, Measurement, and Decisions 



hjvoe clarification. concerning appropriate mea- 
surement models for these cew instructional pro- 
grams can be obtained by properly distinguishing 
between testing 'nstruments and measurement. 
With the availability of a test theory for norm- 
referenced measurement (e.g., see Lord & Novick, 
1968), we have procedures for constructing appro- 
priate measuring instruments, i.e., norm-referenced 
tests. Then, the pertinent question seems to be 
whether or not the instructional models which 
require different knds of measurements (i.e., 
criterion-referenced measurement) also require new 
kinds of tests oi* ^whether the usual tests with 
alternate procedures for interpreting test scores can 
be used. We subscribe to the belief that different 
tests are needed, constructed to meet quite differ 
ent specifications than those typically set for 
norm-referenced tests (Glaser, 1963). We do not 
propose, however, to explicate a developed theory 
of criterion-referenced measurement in this paper 
nor to prescribe a technology for criterion-refer- ' 
enced test development. Such explication should 
be based both on a Vvell-developed instructional 
theory and on a decision-theoretic formulation of 
the measurement problem. Only the latter is even 
touched on here. The test development technology 
woulo be concerned primarily with methods of 
obtaining a representative sample of behaviors 
from a specified domain. 

It should be noted that a norm-referenced test 
can be used for criterioa-referenced measurement. 



albeit with some difficulty, since the selection t)f 
items is such that many objectives will very likely 
not be covered on the test or, at best, «;ill be 
covered with only a few items. A criter jn- 
referenced test constructed by procedures 
especially designed to faciljtate criterion-referenced 
measurement can and sometimes is used to make 
norm-referenced measurements. However, 6 
criterion-referenced test not constructed 
specifically to maximize the variability of test 
scores (whereas a norm-referenced test is). Thus, 
since the distribution of scores on a criterion- 
referenced test will tend to be homogeneous, it 
obvious that such a test will bt less useful fc 
ordering individuals on the measured ability. In 
summary, then, a norm-referenced test can be usedi> 
to make criterion-referenced measurements, and a 
criterion-referenced test can be used to make 
norm-referenced measurements, but neither usage 
will be particularly satisfactory. 

Thus it may be misleading to talk aljou t tests as 
either norm-referenced or criterion-referenced since 
measurements obtained from either testing 
instrument can be explained with .a norm- 
referenced interpretation, criterion-referenced 
interpretation, or both. The important distinction, 
we believe, is between norm -referenced 
measurement and criterion -re fere need 
measurement. This;distinction was made by Glaser 
(1963) but seems to have been ignored by several 
subsequent writers. 



Decision-Theoretic Approach 

Our own conceptual framework for criterion- 
referenced measurement goes this way. Like 
Croijbach and- Gieser (.1965), we see testing as a 



Criterion-Referenced Measurement 

decision-theoretic process. One bf the main differ- 
. ences between norm-referenced tests and criterion- 
referenced tests is in rms of the kinds of 



decisions they are specifically designed to nriake; 
Norm-referenced measurement is particularly use- 
ful in situations where one is interested in "fixed- 
quota" selection or ranking of individuals on some 
ability continuum. Criterion-referenced measure- 
ment involves what Cronbach and Gleser (1965) 
would call a "quota-free" selection problem. That 
is, there is no quota on the number of individuals 
who can exceed the cut-off scores or threshold on 
a criterion-referenced test. A cut-off score is set for 
each subscale of a criterion-referenced test to. 
separate examinees into two mutually exclusive 
groups. One group is made up of examinees with 
high enough test scores (> the cut-off score) to 
infer they have mastered the material to a desired 
tevel of proficienc^^The second grouq is made- up 
of examtnees whodid not achieve the minimum 
proficiency standard. At this stage of the develop- 
ment of a theory of criterion-referenced measure- 
ment, the establishment of cut-off scores is 
primarily a,value judgment. Much research might 
usefully be undertaken to providt guidelines for ^ 
this judgment. The educational goal is, of course, 
to have everyone achieve the standards. This is 
attempted by means such as individualizing instruc- ^ 
tion to the point of providing multiple instruc- 
tional modes (Cronbach, 1967), individual pacing 
and sequencing, as well as providing various 
remedial programs. 

The primary problem" in the new instructional 
models, such as individually prescribed instruction, 
is one of det*»rmining, if TTj, the student's mastery 
level, is greater than a specified standard tt^. Here, 
TTj is the "true" score for an individual i in some 
particularly well-specified contfent domain. Wmay 
represent the proportion of items in the domain he 
could answer successfully. Since we cannot 
administer all items in the domain, we sample some' 
small number to obtain an estimate of ttj, repre- 
sented as TTj. The value of tt^ is the somewhat 
arbitrary thre:>hold score used to divide individuals 
into the two categories described earlier, i.e., 
Masters and Nonmasters. 

Basically then, the examiner's problem is to 
locate each examinee in the correct category.' 
There are two kinds of errors that occur in this 
classification problem: false positives and false 
negatives. A false-positive error occurs when the 
examiner estimates an examinee's ability to be 
above the cuttihg score when, in fact, it is not. A 



false-negative error occurs when the examiner 
estimates an examinee's ability to be below the 
cutting score when the reverse is true. The serious- 
ness of making a false-positive error depends to 
some extent on the structure of the instructional 
objectives. It would seem that this kind of error 
has the most serious effect on program efficiency 
when the instructional objectives are hierarchical in 
nature. On the other hand, the seriousness of 
making a false-negative error' would seem to 
depend on tfe length of time a student would be 
assigned to a remedial program because of his low 
test performance. {Other factors would be the cost 
^of materials, teacher time, facilities,^ etc.) The 
'minimization of expected loss would then depend, 
*n the usual way, on the specified losses and the 
probabilities of incorrect classification. This is then 
a straightforward exercise in the minimization of 
what we would .call threshold loss. 

In antilttempt to view the above discussion in a 
more formal mannet, suppose we take some 
cutting score, tt^, and define a parameter co such 
that 

CJ = 1 if TT > TT^ 
W = 0 if TT < TT^ . 

Now Jf we obtain an estimate of TTj, then an 
estimate of cj can be obtained in the following 
way: a 

CJ = 1, if TT^ and 

W = 0, if TTj < TT^ . 

Defining our error of estimation as icj - oj), it is 
clear that the error takes on one of three values, 
+1/^1, 0, corresponding to whether we make a 
false-positive error, a false-negative error, or a 
correct classification. Also, note that the squares of 
the errors and their absolute values are identical. 
Thits, any procedure that minimizes squared-error 
loss (SEu) in the co metric also minimizes absolute- 
error loss (AEL) in that metric. Furthermore, the 
minimization of SEL aryl A^ L in the o) metric is 
equivalent to the minimization of threshold loss 
for TT in the special case where the losses associated 



4 



with false positives and false negatives are equal 
The criterion-referenced measurement problem is, 
thus, one of determining an estimator cj of co,by 
detern-iining an estimator of tt with a threshold 
loss function and converting this to an estimate of 
CO. We shall exemplify this process shortly. Note 
that with threshold loss, the estimate of tt is not 
a single number but one of two interval^, fO, tt^) or 
(tTq, 1]. It might well, be argued that what we 
describe here is not "measurement" at all; and, in 
fact, it might be useful to avoid use of the term 
measurement in the above context. 

The following example will illustrate an applica- 
tion of threshold loss. To estimate a person's 7r 
value under threshold lols, first write down the 
losses associated with the two kinds of incorrect 
decisions. Thus, we take ^ 

m = 0 if e = 0, 

C(e) = a > 0 \ if e = +1, 

C(e) = b > 0 if e = -1 . 

The expected loss if we set tj = 1 is 

a{Pfoo(7r < TTgldata)], (1) 

J 

( 

and if we^et a) .= 0, it is 

/ blProb(7r > Tr^ldata)]. (2) 

Thus, we set Co = 1 or 0 depending upon whether 
expression (1) or expression (2) is the smaller. Ti. is 
decision corresponds to estimating with thrreshold 
loss whether 7r > Tr^ or 7r < Tr^. Note, however, that 
w6 may decide that'co = 0(7r < tt^), i.e., take a) ~ 0 
not because Prob(7r< 7r^|data)'> Prob(7r> Tr^ldata) 

J Bayesian Estimation 

In order to determine if an examinee has 
mastered a particular skill (i.e., instructional 
objective), we analyze his responses to items on a 
criterion-referenced test designed to measure that 
skill. These items plus the item's designed to 
measure achievement of other skills are organized 
together to form a criterion-referenced test. 



but because a is very^much larger than b,%the loss 
associated with a false positive is very much greater 
than that associated with a false negative. 

Suppose we judge the, loss associated with a false 
positive to be a = 8 "units" and the loss associated 
with a false negative to be b = 1 unit. Further, 
suppose that given the data 

Prob(7r ^ 7r^ ) = .85 and, hence, Prob(7r < 7r^) 15 
then, the value of (1) is 

ajProblTT < TTolldata] = (8) (.15) = 1.2, 
and the value of (2) is ' 

b[Prob(7r > TTolldata] = (1) (.85) = .85. 

Hence, we take co = 0^ and classify the student as a 
nonmaster. Now, notice that the comparison of (1) 
and (2) is equivalent to the comparison of the a/b 
to the ratio 

[Prob(7r > Tr^ldata] / [1 - Prob(7r > Tr^ldata)]. 

This spotlights the fact that the educator need not 
stipulate a and b in any absolute value. He need 
only stipulate the ratio a/b. In this example, ^ince 
Prob(7r > 7r^) = .85, the student will be classified as 
a nonmaster unless the ratio a/b< 5.67. Generally 
wjth a and b as given, a student will be classified as 
a- master only if Prob(7r > ir^) > ^ .89, 
approximately.' - ' 

It should be noted that the above approach 
* generalizes quite easily to situations.where there 
are possibly several different treatT^ents, several 
relevant levels of mastery on each skill, artd several 
different prerequisite skills. Details of such situa- 
tions witi be given elsewhere. 

of Mastery Scores 

Each student is assumed to have some mastery 
score, Tfj, which may be the^proportipn of items in 
the domain he can answer correctly. The measure- 
ment problem is to estimate 7rj from some usually 
small number of test items. Typically, a student's 
mastery score is estimated to be his proRg^on- 
correct score. Mastery scores are estimated for the 



purpose of decision-making: If TTj > tt^, the student 
is senf on to new work; otherwise with ttj < t^, he 
is assigned some remedial work. Before presenting 
a Bayesian solution to the mastery assessment 
problem, let us consider the problem of estimating 
a single student's true score tt. . 

Generally, the method of using the proportion- 
correct as an estimate of TTj is not entirely 
satisfactory when the number of items on which 
the proportion is based is few and when there cire 
many students In situations where one is inter- 
ested in estimating mdny parahrieters; some, by 
chance, will be substantially overestimated and 
others, underestimated. The implication of this is 
that many errors of classification will be made. In 
estimation or in maki'^g -mastery decisioris on the 
basis of small amounts of information, we run the 
risk of making many errors. What is the solution? 
Because of the extensive amount of testing taking 
place, it is)i^ually impractical to consider lengthen- 
ing the test. However, a Bayesian estimation 
^^rocedure proposed by Novick, Lewis, and Jackson 
(1972) provides, at least theoretically, a way of 
obtaining more information .1 each examines 
without requiring the admmistration of any 
additional test items. According to Novick 'et al. 
(1972), this can be done by ucing not only the 
direct information provided by a student's (sub- 
scale) score, but also the collateral information 
contained in the test data of other students. 
(Another possibility and worthy " of further 
research is the possibility of using the student's 
other subscale scores and previous history as 
collateral information.) 

A familiar example of how this can be done 
conned from the application of classica^est theory 
(Lord & Novick, 1968) to norm^referenced 
measurement. Within the classical test theory 
model, each examinee's observed score x on a test 
may be used as an estimate of his true score r. The 
standard deviation of error scores across 
examinees in, the population (standard error of 
measurement) will be Oy(\ — Pxx'^"^ where Oy^ is 
the. standard deviation of observed scores, and py^y» 
is the reliability of the test. This formula provides a 
measure of the inaccuracy, on the average, of using 
the observed score as an estimate of true score. An 
alternative method of estimating true score is to 
use a regression estimate f = xpy^^; + i^^^ (1 - Pxx'^'. 
where //^^ is the mean-observed score in the 



population 0I examinees. It can be shown that the 
average error in the population obtained by using 
f as an estimator of r is ^^x^xx''^ ^^ '^xx'^'^- 
This is called the standard arror of estimation. By 
comparing formulas, it is easily^ seen 
that the standard e;Tor of estimation is smaller 
than the standard error of measurement and is 
substantially smaller than the latter when p^^^* is 
low. This is because, in effect, we are using 
information about the group of which the indi- 
vidual is a member to provide "prior" information 
for the Bayesian estimation of each person's true 
mastery score. With this approach, under common 
/ circumstances, the Bayesian method can effect an 
^ increase of precision equivalent to that which 
would be obtained by adding between 6 and 12 
items to the test (see Novick, Lewis, & Jackson, 
1972). Thus, the Bayesian method has-^mething 
substantial to offer in the contexx'^of norm- 
referenced measurement problems, and similarly, it 
would seem that the same potential exists witK 
criterion-referenced testing problems. 

However, it shotifd be noted that our previous 
discussion has stressed that the threshold-loss 
estimates will b^required. The estimates obtained 
by Novick, Lewis, and Jackson (1972) were based 
./4)n a zero-one loss function, and thus, a modifica- 
^ tion of the Npvipk, Lewis, ^hd Jackson method 
would be desirable.^ At present, cumbersome 
numerical methods would be required to obtain a 
solution. ; 

One example that rather dramatically illustrates 
the effect of the Bayesian estimation procedure is 
the- following. Suppose we administer a criterion- 
referenced test to a group of examinees before and 
after instruction. Let us limit ourselves to the 
problem of estimating mastery scores on a 
particular objective for the group of examii^eeson 
the two test occasions. Suppose that the tests are 
short, and hence, probably have only moderate, 
reliability. Suppose further that the mean pretest 
and posttest scores are .4 and .8, respectively, and 
the threshold score is .65. Now a student with a 
proportion-correct score of .7 on the pretest would 
under the usual procedure be aliowed to skip that 
particular unit of instruction. However, chances are 
that this student's mastery score is overestimated. 
Thb Bayesian analysis might well decide that he 
was^a nonmaster. Speaking Noosely and with 
respect to a squared-error loss method, the 



Bayesian analysis .might regress his estimated score 
further toward the mean than the cutting score 
and, thus, assign him to take instruct^ion on the 
skill. . , - 

Consider now a student with a proportion- 
correct score of .6 oa the posttest. Here the 



Bayesian analysis could be j"-^ that h s ''estimated 
^ score," in effect, exceeds .6& 'rntfn, instead of 
assigning him to some rcnedial progran, he will be 
allowed to go on to new work. T-lovvever,.jf his 
pQsttest group had a mean oerformanc of *68, he , 
would probably be ^^stimater* to be a iVjnmaster: / 



Approaches to Reiiabiiity and Validity Estimation 



In practical applications of criterion-referenced 
testing, it-would seem that in ordet' to evaluate the 
test, it would be negessary io know something . 
about the consistency of decision making across 
parallel forms of the criterion-referenced test or 
acrbss repeated measurem^ras- (i.e., reliability). 
Another aspect of th,e measurement proudure.that 
should seemingly be considered is the accuracy of 
decision making (i.e., validity), '^hfi problem of ' 
reliability and validity estimation tor criterion- ■ 
' referenced tests is considered next. 

Because the designer of a criterion-referenced 
test has^ little^ interest in discriminating among 
examinees; no'attempt is made to select items to* 
produce attest- pf maximum test score variability, 
and thus,' that' variance will typically be sfnali. 
Also, criterion-referenced tests are usually adminis- 
tered either iririmediately before or after sniail units 
of instruction. Thus, it is not surprising thaj we ' 
frequently observe nomogeneous distributions of 
test score5 pn the pre- and »posttests; but centered 
at the low and high ends of the achievement scales, 
respectively. It is well known from the study^-of 
classical test tHeor/ (Lord 8< I^Vick, .1968) that 
when the variance 'of test scpres \s restricted; 
Correlational -estimat§s of reliability and validity . 
will be low. Thus, it seems clear that th^ classical • 
approaches to reliability and validity estimation ' 
will need to be interpreted more cautiously {or 
discarded) in the analysis of criterion-referenced 
tests. Perhaps, an^^^ven more serious reservation 
concerning the classical approach' to reliability and 
,validjty estimation for critefion-rererenced tests, if 
one looks at these psychometric concepts in 
, decision-theoretic terms, is that the correlational 
method represents an inappropriate choice pf a loss 
function (squared-error loss in the rr metric) with' 
which to evaluate a test. This point Will be 
expanded upon later. ' , 



Howeye'. before considering a decision-theore.<c 
approach to reliability sind, validity 'estimation, let 
us review some alternate aporoaches proposed by ' 

' other -writers. Carver (1970) argues that che relia^ 
bility of any test depends upon replicab'lity, but 

' replicdbility is not dependent upon test; score 
variance. If a group of examinees all ob.dfn.simiUr 
scores (to other members of the group) on.parafle* 

• forms of some, criterion-referenced * fes*, near 
perfect replicabilify efxists even though test relia-- 
bility, .estimated using classical correlational 
methods, would be close to zero. This rather , 

." extreme example points o !t the shoacoming of 
the correlational approajh to^eliability estimation. 
Carver (1970) proposed' two statistics to assess 
criterion-referenced test re*^ .bility. First, he says, 
"The reliability, of a single, form of a criterion- 
referenced device could be estimated by admin- 
istering it ta.two comparable groups. The per- 
centage that nr^et the criterion in one group could 
be compared to the percentage that met the 
criterion In the other group (p. 56]." The more, 
comparable the statistics, the more reliable the test 
could be said 'to be. Secondly, Carver suggested 
that the reliability of a criterion-referenced test 
should be assessed'iby comparing the parcentage of 
examinees a(ihieving the criterion or. parallel tests. ^ 
Cox and Graham (1966) report the^use of the 
coefficient of reproducibility as an alternatii)e to 
the classical approach to reliability estimation for 
dne special type of criterion-referenced test. They 
calculate the coefficient fdr a sequentfally scaled 
criterion-referenced test designed for use in a unit 
of instruction where objectives can be identified as 
being sequential in nature. Tests are said to be 

* scalable if - for a particular ordering of items, 
individuals sire able to answer ail questions up to a 
point and none beyond. The coefficient* of 
reproducibility is a measure of the extent to which 



SfOiHi pefiormanor saijsfies this condition As Cox 
{1970> suggests, the problems of usmg- the ' 
coeffjoent of ?epfoduc«bilay as 3 reliability 
est«mare have yet to be determined. 

Another interesung suggestion for reUability 
esi»nn3i«on comes from the work of Livingston 
f 1972a, l972bK He proposes a reliability 
coefficient which t$ based on squared deviations of 
5COfe5 from the performance standard (or cuning 
scofe) rathef than ;he mean' as is done in the 
derivation of reliability for norm r?ferencsd tests ^ 
in dassicat test theory. The result is a reliability 
cxjefltcient ^^hich has several of the important 
properties pf a classical estimiite of reliability. In 
feet. It can be easily s*iowr\ that the classical 
reliability is simply a special case of the new 
?^Jia&$hfy coefficient. However, several psycho- 
nietricians {e.g., Harris, 19723) have expressed ' 
doubts concerning the usefulness of Livingston's 
reliabUity estimate* 

Our own feefmg is that Livingston misses the 
Q^mi for much of criterion-referenced testing. It is 
f>oi, as.*he ^ggests, "?c: know trow far student's] 
score deviates from a fixed standard/* Rather, the 
problem is oe^e of deciding whether a student's tri/e 
ptjfformance ievSt is above or below some cutting 
score in fact, m most practical applications of 
cnte?ion referenced "fes53. the test score is used to 
dichotomize mdrviduafs mio 'either a*'martery*' or 
a "nonm^tery'* category. Thus trom our con- 
ceptaatization of the measurement pfoUtem with 
cruerion referenrad measurement. • Livingston's 
choice of a toss function with which to evaluate 
ttve reliab^iivy .of a aiterion referenoed test is 
wdhq Specifically, *we suggest that squared-error 
loss In ?r^e ;r metric /s not appropriate and iKat 
th^esJ^oJd loss ts appropriate. 

f^ow,, n may be the case that a measurement 
situation \ Will arise with the new instructional 
modete and a squaced-error or absolute error Joss 
function may be appropriate, but ;n such a 
situalnon, it IS unlikely that there would simulta- 
neously be a great concern with a threshold score. 



While there has been little work done on the 
problem of assessing reliability, even less work has 
been reported to date on establishing the validity of 
criterion-referenced test scoresT^bcve all else, a 
criterion-referanced test must have coatent 
vaMdity, According to Popham and Husek (1969), 
content validity is determined by "a carefully 
mooe judgment, based on the test's apparent 
relevcroe to the behaviors legitimately inferable 
from those delimited by the criterion." If tech- 
niques such as those advocated by Hively, 
Patterson, and Page (1968) orBormuth (1970) for 
defining content domains and item generation rules 
are followed, content validity follows. If other 
procedures are used, the task of determining 
content validity becomes more difficul^. 

While we would suggest that the traditional 
qpncepts of reliability arid validity could be 
replaced by a complete decision-theoretic formu- 
lation, it will nevertheless-be useful to point out a 
relationship between these approaches. Suppose we 
are given two criterion-referenced tests which in a 
specified population and for a specified qualifying 
score 7^ are parallel (in the classical sense-see 
Lord & Novick, 1968} in the cu metric. Denote 
the estimates of co^or personal on the two tests by 
the observed scores oj-|j and co2\ and define the 
reliability of the test- as the correlation over 
persons of cu-| j'and co2\- This is, of coarse, classical 
reliability theory in the metric. It is not 
particularly satisfactory for the u^al reasons that 
product moment correlations are unsatisfaclbiy^ 
n^easures of association or agreement for binary 
(zero-one) variables, A more satisfactory measure 
of reliability might simply be the proportion of 
times that the same decision would be made with 
the two parallel instruments. 

Validity theory would take the same form, 
except of course, that a new test Y would serve as 
criterion and the qualifying score on the second 
test need not correspond with the qualifying score 
on the predictor criterion-referenced test. The 
criterion ''test" might well be derived from per- 
formance on the next unit of instruction, or it 
would be a job-related performance criterion. 
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